- Multiple regression
- Fitting a multiple regression
- Drawing inferences from multiple regression
- Examine what variables should be in (or out of) the model
- Diagnostics on multivariate models
2015-07-25
After this class you will be able to
Last class we did this with single predictor model and now we are going to do it with two predictor models.
The mean function is . . .
When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is . . .
a plane
Here is some data from the GSS.
suppressPackageStartupMessages(library(dplyr))
load("data/gss_2010_training.RData")
gss.training <- tbl_df(gss.training)
gss <- select(gss.training, income06_n, educ, maeduc, paeduc) %>%
filter(!is.na(income06_n), !is.na(educ), !is.na(maeduc), !is.na(paeduc))
# NOTE: DROPPING MISSING DATA LIKE THIS CAN BE DANGEROUS
gss <- rename(gss, income = income06_n)
suppressPackageStartupMessages(library(GGally)) pm <- ggpairs(select(gss, educ, paeduc, income)) pm
suppressPackageStartupMessages(library(scatterplot3d))
scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=20)
scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=40)
scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=60)
scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=80)
s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=20)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm)
s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=40)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm)
s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=60)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm)
s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income,
xlab="Education", ylab="Father's education",
zlab="Income category", pch=20, angle=80)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm)
The regression plane is the plane that minimizes the sum of the squared residuals. The residual is the difference between the predicted income and actual income for each person in the sample.
\[\mbox{income}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i + \mbox{residual}_i\]
\[\widehat{\mbox{income}}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i\]
Model is: \[\widehat{\mbox{income}}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i\]
\[\beta_1 = \frac{cor(income, educ) - cor(educ, paeduc) \times cor(income, paeduc)}{ (1 - cor(educ, paeduc)^2 ) } \times \frac{SD(income)}{SD(educ)}\]
Fitting multiple regression in R using lm vs. by hand:
library(broom) fit <- lm(income ~ educ + paeduc, data = gss) tidy(fit)
term estimate std.error statistic p.value 1 (Intercept) 8.99273715 0.51670498 17.40401 1.518822e-63 2 educ 0.63472180 0.03950188 16.06814 6.224087e-55 3 paeduc 0.06738516 0.02832285 2.37918 1.743863e-02
numerator <- cor(gss$income, gss$educ) - (cor(gss$educ, gss$paeduc) * cor(gss$income, gss$paeduc)) denominator <- 1 - cor(gss$educ, gss$paeduc)^2 numerator / denominator * (sd(gss$income) / sd(gss$educ))
[1] 0.6347218
\[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}\] * That is, the regression estimate for \(\beta_1\) is the regression through the origin estimate having regressed \(X_2\) out of both the response and the predictor. * (Similarly, the regression estimate for \(\beta_2\) is the regression through the origin estimate having regressed \(X_1\) out of both the response and the predictor.) * More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables from both the regressor and response.
\[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\]
\[ E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k \]
\[ E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]\] \[= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k + \sum_{k=1}^p x_{k} \beta_k = \beta_1 \] So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
Model \(Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}\) where \(\epsilon_i \sim N(0, \sigma^2)\)
Fitted responses \(\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}\)
Residuals \(e_i = Y_i - \hat Y_i\)
Variance estimate \(\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2\)
To get predicted responses at new values, \(x_1, \ldots, x_p\), simply plug them into the linear model \(\sum_{k=1}^p x_{k} \hat \beta_{k}\)
Coefficients have standard errors, \(\hat \sigma_{\hat \beta_k}\), and \(\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}\) follows a \(T\) distribution with \(n-p\) degrees of freedom.
Predicted responses have standard errors and we can calculate predicted and expected response intervals.

A given data set can conceivably have been generated from uncountably many models. Identifying the true model is like finding a piece of hay in a haystack. Said another way, the model space is massive and the criterion for what constitutes the "best" model is ill-defined.
Best strategy: Use domain knowledge to constrain the model space and/or build models that help you answer specific scientific questions.
Another common strategy:
Tread Carefully!!! (particularly in second strategy)
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know. Donald Rumsfeld
In our context
(Known knowns) Regressors that we know we should check to include in the model and have.
(Known Unknowns) Regressors that we would like to include in the model, but don't have.
(Unknown Unknowns) Regressors that we don't even know about that we should have included in the model.
Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don't have to put in the model.
(If there's too many unobserved confounding variables, even randomization won't help you.)
Including variables that we shouldn't have increases standard errors of the regression variables.
Actually, including any new variables increases (actual, not estimated) standard errors of other regressors. So we don't want to idly throw variables into the model.
The model must tend toward perfect fit as the number of non-redundant regressors approaches \(n\).
\(R^2\) increases monotonically as more regressors are included.
The SSE decreases monotonically as more regressors are included.
A measure of explanatory power of model:
\[ R^2 = SSreg/SST= 1 - RSS/SST \]
But like likelihood, it only goes up with added predictors, therefore we add a penalty.
\[ R^2_{adj} = 1 - \frac{RSS/(n - (p + 1))}{SST/(n - 1)} \]
Nonetheless, choosing the model that has the highest \(R^2_{adj}\) can lead to overfitting.
Akaike Information Criterion: AIC, a balance of goodness of fit and complexity using information theory.
\[ AIC = 2[-log(\textrm{likelihood of model}) + (p + 2)] \]
which can be simplified to,
\[ AIC = n log(\frac{RSS}{n}) + 2p \]
Smaller = better. Tends to overfit in small sample sizes.
AIC Corrected: a bias-corrected version for use on small sample sizes.
\[ AIC_C = AIC + \frac{2(p + 2)(p + 3)}{n - (p + 1)} \]
Bayesian Information Criterion: BIC, for all but the very smallest data sets, takes a heavier penalty for complexity.
\[ BIC = -2 log(\textrm{likelihood of model}) + (p + 2) log(n) \]
Def: the joint probability (actually: density) of all of the data given a particular model. If our \(Y\)s are independent of each other given the \(X\), then:
\[ P(Y_. | X_.) = P(y_1 | x_1) P(y_2 | x_2) \ldots P(y_n | x_n) \]
We will talk about these metrics in lab
The space of models explodes quickly as you add interactions and polynomial terms.
Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
If the models of interest are nested and without lots of parameters differentiating them, it's fairly uncontroversial to use nested likelihood ratio tests.

library(faraway) data(seatpos) head(seatpos)
Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150
m1 <- lm(hipcenter ~ ., data = seatpos) # the dot in the formula notation means 'all other variables' summary(m1)$coef
Estimate Std. Error t value Pr(>|t|) (Intercept) 436.43212823 166.5716187 2.62008697 0.01384361 Age 0.77571620 0.5703288 1.36012113 0.18427175 Weight 0.02631308 0.3309704 0.07950283 0.93717877 HtShoes -2.69240774 9.7530351 -0.27605845 0.78446097 Ht 0.60134458 10.1298739 0.05936348 0.95306980 Seated 0.53375170 3.7618942 0.14188376 0.88815293 Arm -1.32806864 3.9001969 -0.34051323 0.73592450 Thigh -1.14311888 2.6600237 -0.42974011 0.67056106 Leg -6.43904627 4.7138601 -1.36598163 0.18244531
the model has difficulty deciding which of the them is responsibility for the variability in the response.
geometrically, the plane is unstable and will vascillate wildly when fit to a new data set
the multicollinearity increases the variance of your slope estimates, so it becomes difficult to get good/stable/significant estimates.
cor().cor(d)
Y X1 X2 Y 1.0000000 0.7560344 0.7346007 X1 0.7560344 1.0000000 0.3503022 X2 0.7346007 0.3503022 1.0000000
Correlations above 0.7 will often induce considerable variance in your slopes.
The variance of a given slope can be written:
\[ Var(\hat{\beta}_j) = \frac{1}{1 - R^2_j} \times \frac{\sigma^2}{(n - 1) S^2_{x_j}} \]
Where \(R^2_j\) is the \(R^2\) from predicting \(x_j\) using the other predictors and \(S^2_{x_j}\) is the variance of \(x_j\).
The first term is called the VIF: \(1 / (1 - R^2_j)\)
library(car) vif(m1)
X1 X2 1.139876 1.139876
Rule of thumb: VIFs greater than 5 should be addressed.
m1 <- lm(hipcenter ~ ., data = seatpos) vif(m1)
Age Weight HtShoes Ht Seated Arm Thigh Leg 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886 6.694291
Notice any issues? Expected or unexpected?
Multicollinearity suggests that you have multiple predictors that are conveying the same information, so you probably don't need both in the model.
Occam's Razor: since the model with all of the predictors doesn't add any descriptive power, we should favor the simpler model.
subject area knowledge is the best way to decide which to remove
Model selection is about balancing tensions:
Over this course and continuing in 8130, my goal is to give you the confidence and intuition to adeptly navigate these tensions